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ARTICLE INFO ABSTRACT 

Article history: The idea of measuring distance between languages seems to have its roots in the work 
Received 9 December 2009 of the French explorer Dumont D’Urville (1832) [13]. He collected comparative word lists 
Received in revised form 30 January 2010 for various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, 


ate ounne tery 20d in his work concerning the geographical division of the Pacific, he proposed a method 


for measuring the degree of relation among languages. The method used by modern 


P C glottochronology, developed by Morris Swadesh in the 1950s, measures distances from 
ieee the percentage of shared cognates, which are words with a common historical origin. 


Recently, we proposed a new automated method which uses the normalized Levenshtein 
distances among words with the same meaning and averages on the words contained ina 
list. Recently another group of scholars, Bakker et al. (2009) [8] and Holman et al. (2008) [9], 
proposed a refined version of our definition including a second normalization. In this paper 
we compare the information content of our definition with the refined version in order to 
decide which of the two can be applied with greater success to resolve relationships among 
languages. 


Levenshtein distance 


© 2010 Elsevier B.V. All rights reserved. 


1. Introduction 


Glottochronology tries to estimate the time at which languages diverged with the implicit assumption that vocabularies 
change at aconstant average rate. The idea is to consider the percentage of shared cognates in order to compute the distance 
between pairs of languages [1]. These lexical distances are assumed to be, on average, logarithmically proportional to 
divergence times. In fact, changes in vocabulary accumulate year after year and two languages initially similar become more 
and more different. Recent examples of the use of Swadesh lists and cognates to construct language trees are the studies of 
Gray and Atkinson [2] and Gray and Jordan [3]. 

We recently proposed an automated method which uses Levenshtein distances among words in a list [4,5] (another 
automated method used to compare dialects already exists but uses a different normalization of the Levenshtein distance [6] 
based on the length of the alignment). To be precise, we defined the lexical distance between two languages by considering 
a normalized Levenshtein distance among words with the same meaning and averaging on all the words contained in 
a Swadesh list. The normalization is extremely important and no reasonable results can be found without it. Then, we 
transformed the lexical distances into separation times. This goal was reached using a logarithmic rule which is the analogue 
of the adjusted fundamental formula of glottochronology [7]. Finally, the phylogenetic tree could be straightforwardly 
constructed. 

In Refs. [4,5] we tested our method by constructing the phylogenetic trees of the Indo-European and the Austronesian 
groups. 
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At almost the same time, the above described automated method was used and developed by another large group of 
scholars [8,9]. They placed the method at the core of an ambitious project, the ASJP (The Automated Similarity Judgment 
Program). In their work they proposed a refinement of our definition including a second normalization in the definition of 
lexical distance. 

The goal of this paper is to compare the information contents of the two definitions in order to decide which of the two 
can be applied with greater success to resolve relationships among languages. 

Before tackling this problem we sketch our definition of lexical distance and the modification proposed in Refs. [8,9] 
which is a refinement including a second normalization. Then we compare the information content of the two definitions 
and give our conclusion. 


2. Lexical distance 


Our definition of lexical distance between two words is a variant of the Levenshtein distance which is simply the 
minimum number of insertions, deletions, or substitutions of a single character needed to transform one word into the 
other. Our definition is taken as the Levenshtein distance divided by the number of characters of the longer of the two 
compared words. More precisely, given two words œ; and §; their lexical distance D(a, B;) is given by 
Di (aj, Bj) 

L(æi, Pj) 
where D;(«i, 6j) is the Levenshtein distance between the two words and L(œ;, 6j) is the number of characters of the longer 
of the two words œ; and §;. Therefore, the distance can take any value between 0 and 1. Obviously D(q;, a) = 0. 

The normalization is an important novelty and it plays a crucial role; no sensible results can be found without it [4,5]. 

We use distance between pairs of words, as defined above, to construct the lexical distances between languages. For 
any pair of languages, the first step is to compute the distance between words corresponding to the same meaning in the 
Swadesh list. Then, the lexical distance between each language pair is defined as the average of the distances between all 
words [4,5]. As a result we have a number between 0 and 1 which we claim to be the lexical distance between two languages. 

Assume that the number of languages is N and the list of words for any language contains M items. Any language in the 
group is labeled with a Greek letter (say œ) and any word of that language by œ; with 1 < i < M. Then, two words œ; and f; 
in the languages a and f have the same meaning if i = J. 

Then the distance between two languages is 


D(a, Bj) = (1) 


1 
D(a, B) = = > Dio. Bi) (2) 


where the sum goes from 1 to M. Notice that only pairs of words with the same meaning are used in this definition. This 
number is in the interval [0, 1]. Obviously D(a, a) = 0. 

The result of the analysis isan N x N upper triangular matrix whose entries are the N(N — 1) non-trivial lexical distances 
D(a, P) between all pairs in a group. Indeed, our method for computing distances is a very simple operation, that does not 
need any specific linguistic knowledge and requires a minimum of computing time. 

A phylogenetic tree could be constructed from the matrix of lexical distances D(a, 6), but this would only give the 
topology of the tree and the absolute time scale would be missing. Therefore, we perform [4,5] a logarithmic transformation 
of lexical distances which is the analogous to using the adjusted fundamental formula of glottochronology [7]. In this way 
we obtain anew N x N upper triangular matrix whose entries are the times of divergence between all pairs of languages. 
This matrix preserves the topology of the lexical distance matrix but it also contains the information concerning absolute 
time scales. Then, the phylogenetic tree can be straightforwardly constructed. 

In Refs. [4,5] we tested our method by constructing the phylogenetic trees of the Indo-European group and of the 
Austronesian group. In both cases we considered N = 50 languages. The database [10] that we used in Refs. [4,5] is composed 
by M = 200 words for any language. The main source for the database for the Indo-European group is the file prepared by 
Dyen et al. in Ref. [11]. For the Austronesian group we used as the main source the lists contained in the huge database in 
Ref. [12]. 


3. Asecond normalization 
A further modification has been proposed by Refs. [8,9] in order to avoid possible similarities which could arise from 


accidental relative phonological similarities of languages. 
Let us first define the global distance between languages a and p as 


1 
= ra ay 2 Pla Bi) (3) 


where the sum is over all M(M — 1) pairs of words corresponding to different meanings in the two lists (M° is the total 
number of pairs and M is the number of pairs with the same meaning). 
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This quantity measures a distance of the vocabulary of the two languages, without comparing words with the same 
meaning. In other words, it only accounts for general similarities in the frequency and ordering of characters. The point is 
that, at this stage, we don’t know whether T (œ, f) carries information or only depends on accidental similarities. 

Assuming the point of view of Refs. [8,9], it is reasonable to define a bi-normalized lexical distance as follows: 


Dia. 
Dia, P) = ee. 


Notice that while by definition D(a, œ) = D,(a, ~) = 0, in all real cases T (œ, œ) Æ 0. 

This second normalization should cancel the effects of accidental phonological similarities between the two languages. 
The idea was to avoid a situation where unrelated languages that happen to have similar sound structures (e.g., Finnish and 
Japanese) for that reason alone get classified together. It was assumed that eliminating J” would have a positive effect on the 
classification of sets of languages that included unrelated ones and would have no appreciable effect on the classification of 
languages that are related. 

We think that the idea of the proposed second normalization turns out to be correct only if r (œ, 6) is uncorrelated 
with the lexical distance between languages «œ and £. In this case, in fact, it has vanishing information concerning their 
relationship. In contrast, if it is positively correlated with the distance between the two languages, one can conclude that it 
may contain some information that could be usefully exploited. Nevertheless, the information contained in it could also be 
useless in establishing phylogenetic relations. This point will be further discussed in the conclusions. 


(4) 


4. Comparison of different definitions 


In order to decide which definition is better to use, D(a, 6) or D;(a, 6), we have to see whether T (œ, £) is positively 
correlated with these distances. If it is not, we will decide to use D,(a@, 6) since we eliminate errors due to accidental 
Similarities between vocabularies. In contrast, if it is positively correlated, we would conclude that l (œ, 6) carries some 
positive information about the degree of similarity of the two languages. In this second case, two languages will be, on 
average, closer for smaller r (œ, 6), and we would decide to use D(a, £) since it incorporates the information contained in 
r (œ, p). 

In order to compute the correlation between distance and F (œ, 6) we proceed as follows: first we define for a generic 
function f (œ, 6) the average (f) over all possible values of œ and £ as follows: 


1 
= Hp 2 F(ab) (5) 


which is the average value of the function f (œ, 6) in a linguistic group. Then, we define the correlation between D(a, p) 
and I*(a, 6) ina standard way as 


P= (ry= Oy) 
(Cr — (P'y))((D — (D))2))2 


The result is that the correlation in the Indo-European group is 0.59173 while in the Austronesian group it is 0.46032. 
In both cases it is a quite high positive value (correlation may take any value between —1 and 1) and we conclude that 
eventual vocabulary similarities accounted for by F (œ, 6) carry information and are not at all accidental. The weak point is 
that we have checked the correlations against D(a, 6) which, at least from the point of view of the proponents of the second 
normalization, linearly incorporates l (œ, B) since Dia, B) = T (œ, B)D;(a@, B). 

From this point of view our result is not so astonishing. Nevertheless, we can also compute the correlation between the 
bi-normalized distance D,(a@, 6) and r (œ, 6). The definition is the same as (6) with D, substituting for D. We obtain that 
the correlation C(I, D;) in the Indo-European group is 0.54713 while in the Austronesian group it is 0.40169. These two 
numbers, although slightly smaller than the previous ones, are still quite high and confirm that I (œ, 6) contains positive 
information. In other words, closer languages, both in the sense of a smaller D(a, 6) and a smaller D; (œ, 6), will on average 
have smaller F (œ, f). 

We remark that the same correlation coefficients, both for D and for D,, come out if the average (5) is computed neglecting 
the pairs where the same Greek index is repeated. 

In order to complete our analysis we plot, just for the Austronesian languages group, T (œ, 6) as a function of D(a, f) 
(Fig. 1, left) and as a function of D, (œ, £) (Fig. 1, right). Any point in the figures represents a pair of languages. In both cases 
we perceive the positive correlation which is evidenced by the best linear fits. 

We remark that the points which lie on the vertical axes at the distance 0 value correspond, in both figures, to pairs for 
which the same language is compared. For these points the D(a, a) = D;(a, œ) are all equal to 0 while the T (œ, œ) are 
positive. It is easy to see that the self-distances accounted by the F (œ, œ), which compare words with different meaning in 
the same language, are, on average, smaller than the (a, 6) which compare words with different meaning in two different 
languages. This fact confirms that the information carried by I (œ, f) is positive. 

In other words, more closely related languages not only have more similar words corresponding to the same meaning, 
but also have more similar general occurrence and ordering of characters in words. 


CU, D) = (6) 
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Fig. 1. Global distance F (œ, 6) as a function of lexical distance D(a, £) (left) and as a function of bi-normalized distance D; (œ, 6) (right) for Austronesian 
languages. The positive correlations are evidenced by the best linear fits. The points which lie on the vertical axes at the distance 0 value correspond to 
pairs for which the same language is compared. For these points D(a, œ) = D,(a, a) while r (œ, a) Æ 0. 


5. Conclusions 


In this work we have analyzed two different possibilities for the definition of automated language distance. More 
precisely, starting from a Levenshtein distance, we have analyzed two possible normalizations. The comparison between 
them is only made by using statistical arguments. 

Our analysis clearly shows that more closely related languages have smaller global distance. This means that not only do 
they have more similar words for the same meaning, but also they have more similar general occurrence and ordering of 
characters in words. 

We would like to conclude that it is preferable to use the single-normalization definition of distance D(a, $), since 
otherwise a part of the information about affinities of languages is lost. Nevertheless, this conclusion should be empirically 
supported by comparing the distance matrices corresponding to the two definitions with matrices produced by experts. 

At this stage, we have clearly shown that global distances contain some information about language relationships. 
Nevertheless, we are not completely sure that this information is really relevant when constructing phylogenetic trees and 
this remains an open question. 
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